Can string kernels pass the test of time in Native Language Identification?
نویسندگان
چکیده
We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve stateof-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass the test of time and reach state-of-theart performance in the 2017 NLI shared task, despite the recent advances in natural language processing. We participated in all three tracks, in which the competitors were allowed to use only the essays (essay track), only the speech transcripts (speech track), or both (fusion track). Using only the data provided by the organizers for training our models, we have reached a macro F1 score of 86.95% in the closed essay track, a macro F1 score of 87.55% in the closed speech track, and a macro F1 score of 93.19% in the closed ∗ The authors have equally contributed to this work. fusion track. With these scores, our team (UnibucKernel) ranked in the first group of teams in all three tracks, while attaining the best scores in the speech and the fusion tracks.
منابع مشابه
String Kernels for Native Language Identification: Insights from Behind the Curtains
The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character p-grams as features has been proposed for the task of native language identification (NLI). The approach obtained state-of-the-art results by combining several string kernels using...
متن کاملThe Story of the Characters, the DNA and the Native Language
This paper presents our approach to the 2013 Native Language Identification shared task, which is based on machine learning methods that work at the character level. More precisely, we used several string kernels and a kernel based on Local Rank Distance (LRD). Actually, our best system was a kernel combination of string kernel and LRD. While string kernels have been used before in text analysi...
متن کاملCan characters reveal your native language? A language-independent approach to native language identification
A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection,...
متن کاملTelephone Conversation Closing Strategies Used by Persian Speakers: Rapport Management Approach
The use of politeness strategies can help interlocutors promote and/or maintain social harmony in telephone interactions. Using the Rapport Management Model proposed by Spencer-Oatey (2008), this study aimed primarily to reinvestigate the closing structures of telephone conversation (hereafter abbreviated as TC) in Persian and to discover the common politeness strategies used by native Persian ...
متن کاملFast Kernel Methods for SVM Sequence Classifiers
In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithm for the mismatch kernel matrix computation improves cur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017